% Built off of material by Chris Krogslund

By way of introduction…

The data

For more info, see The Guardian Datablog Olympic medal winners: every one since 1896 as open data

First, we’ll use dplyr and tidyr to count the medals of each type won, for each country and each year.

head(medals)
##     City    Sport Discipline               Athlete NOC Gender
## 1 Athens Aquatics   Swimming         HAJOS, Alfred HUN    Men
## 2 Athens Aquatics   Swimming      HERSCHMANN, Otto AUT    Men
## 3 Athens Aquatics   Swimming     DRIVAS, Dimitrios GRE    Men
## 4 Athens Aquatics   Swimming    MALOKINIS, Ioannis GRE    Men
## 5 Athens Aquatics   Swimming    CHASAPIS, Spiridon GRE    Men
## 6 Athens Aquatics   Swimming CHOROPHAS, Efstathios GRE    Men
##                        Event Event_gender  Medal Year Country
## 1             100m freestyle            M   Gold 1896 Hungary
## 2             100m freestyle            M Silver 1896 Austria
## 3 100m freestyle for sailors            M Bronze 1896  Greece
## 4 100m freestyle for sailors            M   Gold 1896  Greece
## 5 100m freestyle for sailors            M Silver 1896  Greece
## 6            1200m freestyle            M Bronze 1896  Greece
# dplyr and tidyr refresher
medal_counts <- medals %>% group_by(Medal, Year, Country) %>% summarise(count = n())
head(medal_counts)
## Source: local data frame [6 x 4]
## Groups: Medal, Year [1]
## 
##    Medal  Year Country count
##    <chr> <int>   <chr> <int>
## 1 Bronze  1896 Austria     2
## 2 Bronze  1896 Denmark     3
## 3 Bronze  1896  France     2
## 4 Bronze  1896 Germany     2
## 5 Bronze  1896  Greece    22
## 6 Bronze  1896 Hungary     3

This table is in tidy format. Wide (untidy) format can be useful for plotting in base plot (more on this later)

medal_counts_wide <- medal_counts %>% spread(key = Medal, value = count) %>%
  ungroup() %>%
  mutate(Bronze = ifelse(is.na(Bronze), 0, Bronze)) %>% 
  mutate(Silver = ifelse(is.na(Silver), 0, Silver)) %>% 
  mutate(Gold = ifelse(is.na(Gold), 0, Gold)) 
head(medal_counts_wide)
## # A tibble: 6 x 5
##    Year   Country Bronze  Gold Silver
##   <int>     <chr>  <dbl> <dbl>  <dbl>
## 1  1896 Australia      0     2      0
## 2  1896   Austria      2     2      1
## 3  1896   Denmark      3     1      2
## 4  1896    France      2     5      4
## 5  1896   Germany      2    26      5
## 6  1896    Greece     22    10     20

Finally, let’s subset the data to gold medal counts for the US, for easier plotting.

usa_gold_medals <-  medal_counts %>% 
  filter(Medal == "Gold") %>%
  filter(Country == "United States")

Base graphics

The general call for base plot looks something like this:

plot(x = , y = , ...)

Additional parameters can be passed in to customize the plot:

More layers can be added to the plot with additional calls to lines, points, text, etc.

plot(medal_counts_wide$Year, medal_counts_wide$Gold) # Basic

plot(usa_gold_medals$Year, usa_gold_medals$count, type = "l", 
     main = "USA Gold Medals",
     xlab = "Year", ylab = "Count") # with updated parameters
points(x = 1984, y = usa_gold_medals$count[usa_gold_medals$Year == 1984], 
       col = "red", pch = 16)

Other plot types

These are just a few other types of plots you can make in base graphics.

boxplot(Gold~Year, data = medal_counts_wide)

hist(medal_counts_wide$Gold)

plot(density(medal_counts_wide$Gold))

barplot(usa_gold_medals$count, width = 4, names.arg = usa_gold_medals$Year, main = "USA Gold Medals")

mosaicplot(Year~Medal, medal_counts)

Object-oriented plots

  • Base graphics often recognizes the object type and will implement specific plot methods
  • lattice and ggplot2 generally don’t exhibit this sort of behavior
medal_lm <- lm(Gold ~ Bronze + Silver, data = medal_counts_wide)
plot(medal_counts_wide %>% select(-Country)) # Calls plotting method for class of the dataset ("data.frame")

plot(medal_lm, which=1:2) # Calls plotting method for class of medal_lm object ("lm"), print first two plots only

Pros/cons of base graphics

  • Good for exploratory data analysis and sanity checks
  • Syntax is inconsistent across functions: some take x,y while others take formulas
  • Defaults plotting parameters are ugly and it can be difficult to customize

lattice and ggplot2

lattice is

  1. faster (though only noticeable over many and large plots)

  2. simpler (at first)

  3. better at trellis graphs

  4. able to do 3d graphs

ggplot2 is

  1. generally more elegant

  2. more syntactically logical (and therefore simpler, once you learn it)

  3. better at grouping

  4. able to interface with maps

Basic usage: lattice

The general call for lattice graphics looks something like this:

graph_type(formula, data = , [options])

The most common graph types:

The specifics of the formula differ for each graph type, but the general format is straightforward

~y             # Show the distribution of y

y~x           # Show the relationship between x and y 

y~x|A         # Show the relationship between x and y conditional on the values of A

y~x|A*B       # Show the relationship between x and y conditional on the combinations of A and B

z~y*x         # Show the 3D relationship between x, y, and z

Let’s recreate some of the plots we made above using lattice. wlattice overlap with the options for base graphics.

xyplot(Gold~Year, data = medal_counts_wide, main = "Gold Medal Counts")

xyplot(count~Year, data = usa_gold_medals, type = "l", main = "USA Gold Medals")

# Boxplot
bwplot(Year~Gold, data = medal_counts_wide, main = "Gold Medal Counts") # Labels are lost when lattice automatically converts Year to a factor

# Flip the coordinates so year increases on the x-axis
bwplot(Gold~Year, data = medal_counts_wide, main = "Gold Medal Counts", horizontal = FALSE) 

# Histogram
histogram(~Gold, medal_counts_wide, type="count", main = "Gold Medal Counts")

# Density plot
densityplot(~Gold, medal_counts_wide, main = "Gold Medal Counts")

# Bar chart
barchart(count~Year, data = usa_gold_medals, width = 4, main = "USA Gold Medals", horizontal = FALSE)

Pros/cons of lattice

  • lattice is useful for comparing cross-sectional variation in trellis plots
  • Plots are generated in a single function call: you can’t go back and add more once it’s created
  • Becoming a proficient user of lattice requires learning a huge array of graph-specific formulas and options

Basic usage: ggplot2

The general call for ggplot2 graphics looks something like this:

ggplot(data = , aes(x = ,y = , [options])) + geom_xxxx() + ... + ... + ...

Note that ggplot2 graphs in layers in a continuing call (hence the endless +…+…+…), which really makes the extra layer part of the call

...+geom_xxxx(data = , aes(x = ,y = ,[options]),[options])+...+...+...

You can see the layering effect by comparing the same graph with different colors for each layer

p <- ggplot(data = medal_counts_wide, aes(x = Year, y = Gold)) + geom_point(color = "gold")
p

p + geom_point(aes(x = Year, y = Silver), color = "gray") + ylab("Medals")

Grammar of Graphics

ggplot2 syntax is very different from base graphics and lattice. It’s built on the grammar of graphics. The basic idea is that the visualization of all data requires four items:

  1. One or more statistics conveying information about the data (identities, means, medians, etc.)

  2. A coordinate system that differentiates between the intersections of statistics (at most two for ggplot, three for lattice)

  3. Geometries that differentiate between off-coordinate variation in kind

  4. Scales that differentiate between off-coordinate variation in degree

ggplot2 allows the user to manipulate all four of these items.

ggplot(medal_counts_wide, aes(x = Year, y = Gold)) + geom_point() + ggtitle("Gold Medal Counts")

ggplot(usa_gold_medals, aes(x = Year, y = count)) + geom_line() + ggtitle("USA Gold Medals")

# Boxplots
ggplot(medal_counts_wide, aes(x = factor(Year), y = Gold)) + geom_boxplot() + ggtitle("Gold Medal Counts")

# Histogram
ggplot(medal_counts_wide, aes(x = Gold)) + geom_histogram() + ggtitle("Gold Medal Counts")

# Density plot
ggplot(medal_counts_wide, aes(x = Gold)) + geom_density() + ggtitle("Gold Medal Counts")

# Bar chart
ggplot(usa_gold_medals, aes(x = Year, y = count)) + geom_bar(stat = "identity")

ggplot2 and tidy data

  • Base graphics/lattice and ggplot2 have one big difference: ggplot2 requires your data to be in tidy format. For base graphics, it can actually be helpful not to have your data in tidy format. Compare:
usa_all_medals <- medal_counts %>% filter(Country == "United States")
usa_all_medals_untidy <- medal_counts_wide %>% filter(Country == "United States") # Medal is the variable, but it spans multiple columns

# Base graphics call
plot(usa_all_medals_untidy$Year, usa_all_medals_untidy$Gold, col = "green", type = "l")
lines(usa_all_medals_untidy$Year, usa_all_medals_untidy$Silver, col = "blue")
lines(usa_all_medals_untidy$Year, usa_all_medals_untidy$Bronze, col = "red")
legend("right", legend = c("Gold", "Silver", "Bronze"), fill = c("green", "blue", "red"))

# ggplot2 call
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(aes(color = Medal))

  • The difference is that ggplot treats Medal as an aesthetic parameter that differentiates kinds of statistics. Base graphics treats each medal type as a layer to the plot.
  • ggplot2 plays nice with dplyr and pipes. If you want to manipulate your data specifically for one plot but not save the new dataset, you can call your dplyr chain and pipe it directly into a ggplot call:
# Assuming you've already subsetted your data to USA Gold medals
ggplot(usa_gold_medals, aes(x = Year, y = count)) + geom_line()

# This combines the subsetting and plotting into one step
medal_counts %>% 
  filter(Medal == "Gold") %>%
  filter(Country == "United States") %>%
  ggplot(aes(x = Year, y = count)) + geom_line()

Pros/cons of ggplot2

  • Allows you to add features in “layers”
  • Automatically adjusts spacing and sizing as you add more layers
  • Requires data to be in tidy format
  • Syntax is different from base R – there is a learning curve

Comparing lattice and ggplot2

lattice v. ggplot2: Densities

ggplot(data = usa_gold_medals, aes(x = count)) + geom_density() # ggplot2
densityplot(~count, data = usa_gold_medals) # lattice

lattice v. ggplot2: X-Y scatter plots

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() # ggplot2
xyplot(count~Year, data = usa_gold_medals) # lattice

lattice v. ggplot2: X-Y line plots

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_line() # ggplot2
xyplot(count~Year, data = usa_gold_medals, type = "l") # lattice

lattice v. ggplot2: bar plots

# Create a dataframe of median number of gold medals by country
median_gold_medals <- medal_counts %>% 
  filter(Medal == "Gold") %>% 
  group_by(Country) %>%
  summarise(med = median(count))

ggplot(data = median_gold_medals[1:15, ], aes(x = Country, y = med)) + geom_bar(stat="identity") # ggplot2
barchart(med~Country, data = median_gold_medals[1:15, ]) # lattice

lattice v. ggplot2: boxplots

# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = medal_counts_wide, aes(x = factor(Year), y = Gold)) + geom_boxplot() # ggplot2
bwplot(Gold~factor(Year), data = medal_counts_wide, horizontal = FALSE) # lattice

lattice v. ggplot2: “trellis” plots

# Subset the data to North America countries for easier viewing
northern_hem <- medal_counts_wide %>% filter(Country %in% c("United States", "Canada", "Cuba", "Mexico"))

ggplot(data = northern_hem, aes(x = Year, y = Gold)) + geom_point() + facet_wrap(~Country) # ggplot2
xyplot(Gold~Year|Country, data = northern_hem) # lattice

lattice v. ggplot2: countour plots

data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]  100  100  101  101  101  101  101  100  100   100
##  [2,]  101  101  102  102  102  102  102  101  101   101
##  [3,]  102  102  103  103  103  103  103  102  102   102
##  [4,]  103  103  104  104  104  104  104  103  103   103
##  [5,]  104  104  105  105  105  105  105  104  104   103
##  [6,]  105  105  105  106  106  106  106  105  105   104
##  [7,]  105  106  106  107  107  107  107  106  106   105
##  [8,]  106  107  107  108  108  108  108  107  107   106
##  [9,]  107  108  108  109  109  109  109  108  108   107
## [10,]  108  109  109  110  110  110  110  109  109   108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)
##   Var1 Var2 value
## 1    1    1   100
## 2    2    1   101
## 3    3    1   102
## 4    4    1   103
## 5    5    1   104
## 6    6    1   105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) + geom_contour() # ggplot2
contourplot(zvar~xvar + yvar, data = volcano3d) # lattice

lattice v. ggplot2: tile/image/level plots

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) + geom_tile(aes(fill = zvar)) # ggplot2
levelplot(zvar~xvar + yvar, data = volcano3d) # lattice

lattice: 3D plots

# Create a subset of the dataset containing only data for France
cloud(Gold~Bronze*Silver, data = northern_hem)

cloud(Gold~Bronze*Silver|Country, data = northern_hem)

lattice v. ggplot2: options

Main difference to keep in mind: lattice options behave similarly to base plot, ggplot handles options in additional layers.

Labels

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  xlab(label = "Year") + 
  ylab(label = "Number of Gold Medals Won") + 
  ggtitle(label = "Cool Graph") # ggplot2

xyplot(count~Year, data = usa_gold_medals, 
       xlab = "Year", ylab = "Number of Gold Medals Won", 
       main = "Cool Graph") # lattice

Axis and point scales

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(size=3) # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(size=1) # ggplot2

xyplot(count~Year, data = usa_gold_medals) # lattice
xyplot(count~Year, data = usa_gold_medals, cex = 2) # lattice
xyplot(count~Year, data = usa_gold_medals, cex = .5) # lattice

Graphical parameters

Colors

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(color = colors()[11]) # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(color = "red") # ggplot2

xyplot(count~Year, data = usa_gold_medals, col = colors()[11]) #lattice
xyplot(count~Year, data = usa_gold_medals, col = "red") #lattice

Anatomy of aes()

ggplot(data = , aes(x = , y = , color = , linetype = , shape = , size=))

These four aesthetic parameters (color, linetype, shape, size) can be used to show variation in kind (categories) and variation in degree (numeric).

Parameters passed into aes should be variables in your dataset.

Parameters passed to geom_xxx outside of aes should not be related to your dataset - they apply to the whole figure.

ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(aes(color = Medal))

Note what happens when we specify the color parameter outside of the aesthetic operator. ggplot2 views these specifications as invalid graphical parameters.

ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_point(color = Medal)
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomPoint, : object 'Medal' not found
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_point(color = "Medal")
## Error in grDevices::col2rgb(colour, TRUE): invalid color name 'Medal'
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_point(color = "red")

Point Styles and Widths

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(shape = 3) # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(shape = "w") # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point(shape = "$", size=5) # ggplot2

xyplot(count~Year, data = usa_gold_medals, pch = 3) # lattice
xyplot(count~Year, data = usa_gold_medals, pch = "w") # lattice
xyplot(count~Year, data = usa_gold_medals, pch = "$", cex = 2) # lattice

Line Styles and Widths

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_line(linetype = 1) # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_line(linetype = 2) # ggplot2
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_line(linetype = 5, size = 2) # ggplot2

xyplot(count~Year, data = usa_gold_medals, type = "l", lty = 1) # lattice
xyplot(count~Year, data = usa_gold_medals, type = "l", lty = 2) # lattice
xyplot(count~Year, data = usa_gold_medals, type = "l", lty = 3, lwd = 3) # lattice

Using aesthetics to highlight features

Differences in kind

ggplot(data = northern_hem, aes(x = Year, y = Gold)) + geom_line(aes(linetype = Country))
ggplot(data = northern_hem, aes(x = Year, y = Gold)) + geom_line(aes(shape = Country))

Differences in degree

ggplot(data = northern_hem, aes(x = Year, y = Silver)) + geom_point(aes(color = Gold))
ggplot(data = northern_hem, aes(x = Year, y = Silver)) + geom_point(aes(size = Gold))

Multiple non-coordinate aesthetics (differences in kind using color, degree using point size)

ggplot(data = northern_hem, aes(x = Year, y = Silver)) + geom_point(aes(size = Gold, color = Country))

Fitted lines and curves with ggplot2

ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point()

# Add linear model (lm) smoother
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  geom_smooth(method = "lm")

# Add local linear model (loess) smoother, span of 0.75 (more smoothed)
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  geom_smooth(method = "loess", span = .75)

# Add local linear model (loess) smoother, span of 0.25 (less smoothed)
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  geom_smooth(method = "loess", span = .25)

# Add linear model (lm) smoother, no standard error shading 
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

# Add local linear model (loess) smoother, no standard error shading 
ggplot(data = usa_gold_medals, aes(x = Year, y = count)) + geom_point() + 
  geom_smooth(method = "loess", se = FALSE)

# Add a local linear (loess) smoother for each medal, no standard error shading
ggplot(data = usa_all_medals, aes(x = Year, y = count)) + 
  geom_point(aes(color = Medal)) + 
  geom_smooth(aes(color = Medal), method = "loess", se = FALSE)

lattice v. ggplot2: Combining Multiple Plots

# Initialize gridExtra library
library(gridExtra)

# Create 3 plots to combine in a table
plot1 <- ggplot(data = medal_counts_wide, aes(x = Year, y = Gold)) + 
  geom_point(color = "gold") + 
  geom_point(aes(x = Year, y = Silver), color = "gray") + 
  geom_point(aes(x = Year, y = Bronze), color = "brown") + 
  ylab("Medals")
plot2 <- ggplot(data = usa_all_medals, aes(x = Year, y = count)) + geom_line(aes(color = Medal))
plot3 <- ggplot(data = northern_hem, aes(x = Year, y = Gold)) + geom_point(aes(color = Country))


# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol = 1)

Exporting

Two basic image types:

Raster/Bitmap (.png, .jpeg)

Every pixel of a plot contains its own separate coding; not so great if you want to resize the image

jpeg(filename="example.png", width=, height=)
plot(x,y)
dev.off()

Vector (.pdf, .ps)

Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing

pdf(filename="example.pdf", width=, height=)
plot(x,y)
dev.off()

Exporting with lattice v. ggplot

# Assume we saved our plot is an object called example.plot

# lattice
trellis.device(device="pdf", filename="example.pdf")
print(example.plot)
dev.off()

# ggplot2
ggsave(filename="example.pdf", plot=example.plot, scale=, width=, height=) # ggplot2

Breakout

You’re welcome to try out either lattice or ggplot for these questions, but in the solutions we’ll focus on the ggplot approach.

For some of these you may want to use a smaller version of the dataset, such as a random subset, subset <- air[sample(1:nrow(air), 10000, replace = FALSE), ].

Basics

  1. Plot a histogram of the flight delays with negative delays set to zero, censoring delay times at a maximum of 60 minutes.

  2. Plot the arrival delay against the departure delay as a scatterplot.

  3. Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.

  4. Make a boxplot of departure delays conditional on the day of week.

Using the ideas

  1. Create a trellis plot of departure delay boxplots, one per destination for this subset of destinations, DestSubset <- c('LAX','SEA','PHX','DEN','MSP','JFK','ATL','DFW','IAH', 'ORD'). Use a 2x5 layout of panels in the plot.

  2. Subset the data to flights going to Chicago (ORD) and Houston (IAH). Plot arrival delay against scheduled departure time (CRSDepTime). Now plot so that flights to Chicago are in one color and those to Houston in another. Use scale_x_continuous() and scale_y_continuous() to set the x-axis limits to be in the range from 6 am to midnight and the y-axis limits to be in the range (-10, 120).

Advanced

  1. Create a trellis plot where, for a given destination (see the subset in question 5), each panel uses a) hollow circles to plot departure delay as a function of time of day, and b) a red loess smoother without standard errors to plot the trend in departure delay over time of day. Limit the time of day shown to 6 am to midnight, and turn off the grey background. Figure out how to use partially-transparent points to reduce the effect of the overplotting of points.